PEARSONR

Overview

The PEARSONR function calculates the Pearson correlation coefficient (often denoted as r) and an associated p-value to test whether two datasets are linearly related. Developed by Karl Pearson in the 1890s, this statistic is one of the most widely used measures of linear association in statistics and data analysis. For more details, see the SciPy documentation and the Pearson correlation coefficient Wikipedia article.

The Pearson correlation coefficient measures the strength and direction of the linear relationship between two variables. The coefficient ranges from −1 to +1, where +1 indicates a perfect positive linear relationship, −1 indicates a perfect negative linear relationship, and 0 indicates no linear correlation. This function uses the SciPy library’s scipy.stats.pearsonr implementation, available on GitHub.

The correlation coefficient is calculated as:

r = \frac{\sum (x - m_x)(y - m_y)}{\sqrt{\sum (x - m_x)^2 \sum (y - m_y)^2}}

where m_x is the mean of variable x and m_y is the mean of variable y. This formula normalizes the covariance of the two variables by the product of their standard deviations.

The function also performs a hypothesis test for the null hypothesis that the population correlation coefficient is zero (i.e., no linear relationship exists). The p-value represents the probability of observing a correlation at least as extreme as the computed value, assuming the null hypothesis is true. Under the assumption that both variables are drawn from independent normal distributions, the exact distribution of the sample correlation coefficient follows a beta distribution on the interval [−1, 1] with shape parameters a = b = n/2 - 1, where n is the sample size.

A low p-value (typically < 0.05) suggests that the observed correlation is statistically significant and unlikely to have occurred by chance. However, statistical significance does not imply causation, and correlation alone cannot establish causal relationships between variables.

This example function is provided as-is without any representation of accuracy.

Excel Usage

=PEARSONR(x, y)
  • x (list[list], required): First set of observations (column vector)
  • y (list[list], required): Second set of observations (column vector), same length as x

Returns (list[list]): 2D list [[correlation, p_value]] where correlation is the Pearson correlation coefficient and p_value is the two-tailed p-value for testing non-correlation, or an error message (str) if input is invalid.

Examples

Example 1: Perfect positive correlation

Inputs:

x y
1 2
2 4
3 6
4 8

Excel formula:

=PEARSONR({1;2;3;4}, {2;4;6;8})

Expected output:

Result
1 0

Example 2: Perfect negative correlation

Inputs:

x y
1 8
2 6
3 4
4 2

Excel formula:

=PEARSONR({1;2;3;4}, {8;6;4;2})

Expected output:

Result
-1 0

Example 3: Reverse order sequence

Inputs:

x y
1 5
2 4
3 3
4 2
5 1

Excel formula:

=PEARSONR({1;2;3;4;5}, {5;4;3;2;1})

Expected output:

Result
-1 0

Example 4: Real data with moderate negative correlation

Inputs:

x y
1 10
2 9
3 2.5
4 6
5 4
6 3
7 2

Excel formula:

=PEARSONR({1;2;3;4;5;6;7}, {10;9;2.5;6;4;3;2})

Expected output:

Result
-0.8285 0.0213

Python Code

from scipy.stats import pearsonr as scipy_pearsonr
import math

def pearsonr(x, y):
    """
    Calculate the Pearson correlation coefficient and p-value for two datasets.

    See: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html

    This example function is provided as-is without any representation of accuracy.

    Args:
        x (list[list]): First set of observations (column vector)
        y (list[list]): Second set of observations (column vector), same length as x

    Returns:
        list[list]: 2D list [[correlation, p_value]] where correlation is the Pearson correlation coefficient and p_value is the two-tailed p-value for testing non-correlation, or an error message (str) if input is invalid.
    """
    # Helper function to normalize input to 2D list
    def to2d(data):
        return [[data]] if not isinstance(data, list) else data

    # Normalize inputs to 2D lists
    x = to2d(x)
    y = to2d(y)

    # Validate dimensions and structure
    if not isinstance(x, list) or not isinstance(y, list):
        return "Error: x and y must be lists"
    if not x or not y or not isinstance(x[0], list) or not isinstance(y[0], list):
        return "Error: x and y must be 2D lists"
    if len(x) < 2 or len(y) < 2:
        return "Error: x and y must have at least two rows"
    if len(x) != len(y):
        return "Error: x and y must have the same length"

    # Extract numeric values and validate
    try:
        x_flat = [float(row[0]) for row in x]
        y_flat = [float(row[0]) for row in y]
    except (ValueError, TypeError, IndexError):
        return "Error: x and y must contain numeric values"

    # Compute correlation and p-value
    try:
        result = scipy_pearsonr(x_flat, y_flat)
        corr = float(result.statistic)
        pval = float(result.pvalue)

        # Check for invalid results
        if math.isnan(corr) or math.isinf(corr) or math.isnan(pval) or math.isinf(pval):
            return "Error: Calculation resulted in non-finite values"

        return [[corr, pval]]
    except Exception as e:
        return f"Error: Error in pearsonr calculation: {str(e)}"

Online Calculator